May 14th 2020

Introduction to the study

  • Raw data
## # A tibble: 242 x 100
##   Snake Reference Note  `SVMP (Snake Ve… `PI-SVMP (Snake… `PII-SVMP (Snak…
##   <chr> <chr>     <chr>            <dbl>            <dbl>            <dbl>
## 1 Agki… https://… Mexi…             24.5                0                0
## 2 Agki… https://… Cost…             30.8                0                0
## 3 Agki… https://… Mexi…             30.6                0                0
## 4 Agki… https://… Orig…             32.5                0                0
## # … with 238 more rows, and 94 more variables: `PIII-SVMP (Snake Venom
## #   Metalloproteinase PIII), %` <dbl>, …
  • Newly found data
## # A tibble: 27 x 4
##   Toxin               `Vipera aspis asp… `Vipera berus ber… `Vipera anatolica s…
##   <chr>                            <dbl>              <dbl>                <dbl>
## 1 SVMP (Snake Venom …               13.4                 NA                 42.9
## 2 3Ftx (three-finger…               NA                   NA                 NA  
## 3 Unknown peptides                  NA                   NA                 23.5
## 4 PLA2 (Phospholipas…               30.9                 NA                  8.2
## # … with 23 more rows

Goal of study

  • Develop a tool for venom composition analysis
  • Group snakes by family based on venom composition (PCA, K-means, ANN)

Project outline

  • Loading and cleaning data
    • Map locations to country
  • Augmentation of data
    • Merge datasets
    • Create genus and species columns
    • Group toxins
  • Analysis and visualisations
    • Geographical and genus distribution
    • Venom composition analysis
  • Unsupervised analysis
    • PCA
    • K-means clustering
  • Supervised classification model
    • Artificial Neural Network (ANN)

Materials and methods

  • Data processing and modelling as well as the creation of this presentation was performed in Rstudio Cloud.

  • Coding followed the tidyverse style guide by Hadley Wickham.

  • Results obtained from modelling using Artificial Neural Networks were performed in another project.

  • Whole project exists at github at: https://github.com/rforbiodatascience/2020_group04

Used packages: httr, readxl, tidyverse, knitr, plotly, maps, patchwork, shiny, rsconnect, keras, devtools

Tidying and transforming data

Tidying and transforming data

  • Tidy raw data
    • Load and clean data
  • Transform data
    • Join new data
    • Group toxins
    • Remove toxins found in fewer than five snakes
    • Map genus to snake family
## # A tibble: 233 x 43
##   Snake Genus Species Reference Country Family SVMPi `DC-fragment` CRISP `3Ftx`
##   <chr> <chr> <chr>   <chr>     <chr>   <chr>  <dbl>         <dbl> <dbl>  <dbl>
## 1 Agki… Agki… biline… https://… Mexico  Viper…     0           0    0         0
## 2 Agki… Agki… biline… https://… Costa … Viper…     0           0    0         0
## 3 Agki… Agki… biline… https://… Mexico  Viper…     0           0    5.6       0
## 4 Agki… Agki… contor… https://… Unknown Viper…     0           0.1  3.7       0
## 5 Agki… Agki… contor… https://… USA     Viper…     0           0    1.96      0
## 6 Agki… Agki… contor… https://… USA     Viper…     0           0    0         0
## 7 Agki… Agki… contor… https://… USA     Viper…     0           0    1.9       0
## # … with 226 more rows, and 33 more variables: PLB <dbl>, …

Analysis and visualisations

Geographical overview of samples

Snakes from richer countries or countries with a focus on snake research is overrepresented.

Genus distribution according to family

Venom composition in snake families

Venom composition in snake families

Toxin abundances

Comparing venom composition between snake species

Comparing venom composition within species

Shiny app

Unsupervised and supervised learning

Results from PCA and K-means

Prediction model based on venom composition

A simple vanilla ANN managed to correctly classify the whole testset (25 % of data).

  • Specifications: 4 hidden neurons, learning rate = 0.001, n_epocs = 100, loss criterion = Binary Crossentropy.

Theoretic analysis of incorrect labels

To investigate a case with misclassified snakes, a new model was trained with a test size of 40%, 5 snakes are misclassified as illustrated below:

Analysis of special cases

Incorrectly labeled snakes by sub-optimal ANN:

## # A tibble: 5 x 2
##   Snake                    Family   
##   <chr>                    <chr>    
## 1 Daboia russelii russelii Viperidae
## 2 Hydrophis cyanocinctus   Elapidae 
## 3 Micropechis ikaheka      Elapidae 
## 4 Naja kaouthia            Elapidae 
## 5 Naja kaouthia            Elapidae

Snake from K-means cluster 2:

## # A tibble: 1 x 2
##   Snake             Family  
##   <chr>             <chr>   
## 1 Bungarus candidus Elapidae

Shiny app

Static plots for publication

Comparing venom composition between snake species

Comparing venom composition within species